I have learned about the course Introduction to Open Data Science through my PhD studies. I’m exited to start the course but also a bit worried about the workload. The course Moodle page seems to contain a lot of information, so probably will take a while to get a handle on things.
I have previously used R and Rstudio on some courses, but I’m no expert. I think that exercise 1 was too long, and all in all there was too much stuff to be learned about R in the first week. To me, this one really long exercise feels exhausting. It is nice that there are the R codes already embedded, but I feel smaller chunks would be better. If I hadn’t done R before, I would be terrified right now.
date()
## [1] "Sat Nov 26 09:40:37 2022"
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.10
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
students2014 <- read_csv("./data/learning2014.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## gender = col_character(),
## age = col_double(),
## attitude = col_double(),
## deep = col_double(),
## stra = col_double(),
## surf = col_double(),
## points = col_double()
## )
dim(students2014)
## [1] 166 7
str(students2014)
## spc_tbl_ [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
Dataset is created from data: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt
Data is collected from students in course ‘Johdatus
yhteiskuntatilastotieteeseen’ in fall 2014.
Dataset students2014 contains 7 variables with 166 observations: gender, age, attitude, deep, stra, surf, and points.
All variables in the original data related to strategic, deep, and surface learning have been combined into three variables (stra, deep, and surf, respectively), and scaled to original scales by taking the mean.
Attitude tells about student’s global attitude towards statistics. Variable has been scaled back to the original scale of the questions, by dividing it with the number of questions.
Points tells the course exam points. Observations with points equaling 0 have been excluded, since those students didn’t attend the course exam.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a plot matrix with ggpairs()
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
summary(factor(students2014$gender))
## F M
## 110 56
summary(students2014$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 21.00 22.00 25.51 27.00 55.00
summary(students2014$attitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 2.600 3.200 3.143 3.700 5.000
summary(students2014$deep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 3.333 3.667 3.680 4.083 4.917
summary(students2014$stra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.250 2.625 3.188 3.121 3.625 5.000
summary(students2014$surf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 2.417 2.833 2.787 3.167 4.333
summary(students2014$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 19.00 23.00 22.72 27.75 33.00
There are 66% (110/166) women in respondents. The median age is 22 years (range 17-55), the median age of women seems to be a bit lower than that of men.
The median attitude is 3.2 (range 1.4-5). The median attitude of men seems higher than that of women.
The median for deep learning is 3.7 (range 1.6-4.9). The median for strategic learning is 3.1 (range 1.3-5). The median for surface learning is 2.8 (range 1.6-4.3). For women, the median for strategic and surface learning seems a bit higher than for men.
The median for exam points is 23 (range 7-33). The median is about the same in men and women.
Attitude correlates with exam points, so that with “better” attitude towards statistics one usually gets higher exam points.
Surface learning correlates negatively with deep and strategic learning as well as attitude. So with better attitude one less often utilizes surface learning. If one utilizes surface learning, one less often utilizes deep or strategic learning.
#fit a linear regression model using points as the outcome and attitude, strategig and surface learning as explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
#remove surface learning from the model, since it's so far from significant
my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
#error rate
sigma(my_model2)*100/mean(students2014$points)
## [1] 23.28148
First we choose three variables that correlate the most with points in above visualization (namely attitude, strategic and surface learning). However, after fitting the model, we see that for surface learning there is a really high p-value (0.466), so we remove that and fit the model again with just attitude and strategic learning.
In the new model, p-value of the F-statistic is 7.734e-09. So at least one of the explanatory variables is significantly associated with exam points.
For attitude, p-value is <0.001, so it is significantly associated with exam points. According to this model, if attitude increases by 1, exam points increase by approximately 3.5.
For strategic learning, p-value is 0.089. It isn’t <0.05 that’s usually deemed significant, however, it is quite close. So we probably don’t want to state that strategic learning has no effect on exam points. On the other hand, the estimated intercept for strategic learning is 0.9, so a change in attitude makes a much bigger difference than a change in strategic learning.
The adjusted R-squared is 0.1951, which means that attitude (and strategic learning) only explain around 20% of the variance in exam points.
The residual standard error is 5.289 which corresponds to 23% prediction error rate.
plot(my_model2, which=c(1,2,5))
Linear regression assumes a linear relationship between predictors and outcome, that residual errors are normally distributed, that residuals have a constant variance, and independence of residual error terms.
With Residuals vs. Fitted plot we can check the linearity of data. Here, the red line seems to be horizontal at approximately zero, where it should be. So we can assume a linear relationship.
With plot Normal Q-Q we can check if residual errors are normally distributed. The plot of residuals should more or less follow the dashed line. Here they mostly do so, with some exceptions at the upper and lower ends. However, we may assume normality.
The Residuals vs. Leverage plot marks three most extreme points (35, 71, 145). Two of them exceed 3 standard deviations, so they are possible outliers.
date()
## [1] "Sat Nov 26 09:40:48 2022"
library(tidyverse)
alc <- read_csv("./data/alc.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## age = col_double(),
## Medu = col_double(),
## Fedu = col_double(),
## traveltime = col_double(),
## studytime = col_double(),
## famrel = col_double(),
## freetime = col_double(),
## goout = col_double(),
## Dalc = col_double(),
## Walc = col_double(),
## health = col_double(),
## failures = col_double(),
## absences = col_double(),
## G1 = col_double(),
## G2 = col_double(),
## G3 = col_double(),
## alc_use = col_double(),
## high_use = col_logical()
## )
## i Use `spec()` for the full column specifications.
dim(alc)
## [1] 370 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Dataset is created from data: https://archive.ics.uci.edu/ml/datasets/Student+Performance
The data approaches student achievement in secondary education of two
Portuguese schools, and the dataset joins two student alcohol
consumption datasets. The variables not used for joining the two data
have been combined by averaging (including the grade variables).
‘alc_use’ is the average of ‘Dalc’ and ‘Walc’. ‘high_use’ is TRUE if
‘alc_use’ is higher than 2 and FALSE otherwise.
Dataset alc contains 35 variables with 370 observations.
I think interesting variables in relation to alcohol consumption are
sex, number of past class failures, final grade (G3) and number of
school absences.
My hypothesis for these variables are:
1. Men’s alcohol consumption is higher than women’s.
2. Those who consume more alcohol have more past class failures compared
to those who consume less alcohol.
3. Those who consume more alcohol have lower final grades compared to
those who consume less alcohol.
4. Those who consume more alcohol have more school absences compared to
those who consume less alcohol.
library(dplyr); library(ggplot2); library(GGally); library(finalfit)
dependent <- "high_use"
explanatory <- c("sex", "failures", "G3", "absences")
alc %>%
summary_factorlist(dependent, explanatory, p = TRUE,
add_dependent_label = TRUE)
## Warning in chisq.test(failures, high_use): Chi-squared approximation may be
## incorrect
## Dependent: high_use FALSE TRUE p
## sex F 154 (59.5) 41 (36.9) <0.001
## M 105 (40.5) 70 (63.1)
## failures 0 238 (91.9) 87 (78.4) 0.003
## 1 12 (4.6) 12 (10.8)
## 2 8 (3.1) 9 (8.1)
## 3 1 (0.4) 3 (2.7)
## G3 Mean (SD) 11.8 (3.4) 10.9 (3.0) 0.011
## absences Mean (SD) 3.7 (4.5) 6.4 (7.1) <0.001
#median and quartiles of the numeric variables
summary(alc$failures)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1892 0.0000 3.0000
summary(alc$G3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 12.00 11.52 14.00 18.00
summary(alc$absences)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 4.511 6.000 45.000
#draw a barplot of alcohol consumption by sex
ggplot(data = alc, aes(x = high_use, fill=sex)) + geom_bar()
#produce summary statistics of final grade by alcohol consumption and sex
alc %>% group_by(sex, high_use) %>% summarise(mean_grade=mean(G3),count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use mean_grade count
## <chr> <lgl> <dbl> <int>
## 1 F FALSE 11.4 154
## 2 F TRUE 11.8 41
## 3 M FALSE 12.3 105
## 4 M TRUE 10.3 70
#draw a boxplot of final grade by alcohol consumption and sex
ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab("grade")
#produce summary statistics of school abscences by alcohol consumption and sex
alc %>% group_by(sex, high_use) %>% summarise(mean_absences=mean(absences),count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use mean_absences count
## <chr> <lgl> <dbl> <int>
## 1 F FALSE 4.25 154
## 2 F TRUE 6.85 41
## 3 M FALSE 2.91 105
## 4 M TRUE 6.1 70
#draw a boxplot of school absences by alcohol consumption and sex
ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
There are 195 women and 175 men. Of those, 21% (41/195) and 40% (70/175) have high alcohol consumption, respectively. This supports my hypothesis of men consuming more alcohol than women.
The mean of past class failures is 0.2 (range 0-3). Of high users, 3% (3/111) have 3 failures, 8% (9/111) have 2 failures, and 11% have 1 failure. Of others, 0.4% (1/259) have 3 failures, 3% (8/259) have 2 failures, and 5% have 1 failure. Thus it would seem that students with high alcohol consumption have failed class more often than those with low alcohol consumption.
The mean final grade is 11.5 (range 0-18). Men with high alcohol consumption seem to have lower final grades than people with low alcohol consumption. Interestingly, women with high alcohol consumption seem to have around the same final grades than people with low alcohol consumption. So my hypothesis is supported only by observations of men.
The mean number of absences is 4.5 (range 0-45, median 3). The median of school absences is higher for men with high alcohol consumption than other groups. However, if we look at the mean, women with high alcohol consumption have the highest mean of absences, and both men and women with high alcohol consumption have higher mean than those with low alcohol consumption. This is somewhat in line with my hypothesis.
# fit a logistic regression model
my_model <- glm(high_use ~ sex + failures + absences + G3, data = alc, family = "binomial")
# print out a summary of the model
summary(my_model)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1561 -0.8429 -0.5872 1.0033 2.1393
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.38733 0.51617 -2.688 0.00719 **
## sexM 1.00870 0.24798 4.068 4.75e-05 ***
## failures 0.50382 0.22018 2.288 0.02213 *
## absences 0.09058 0.02322 3.901 9.56e-05 ***
## G3 -0.04671 0.03948 -1.183 0.23671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 405.59 on 365 degrees of freedom
## AIC: 415.59
##
## Number of Fisher Scoring iterations: 4
# fit a new model without G3 model
my_model2 <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(my_model2)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1550 -0.8430 -0.5889 1.0328 2.0374
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.94150 0.23129 -8.394 < 2e-16 ***
## sexM 0.99731 0.24725 4.034 5.49e-05 ***
## failures 0.59759 0.20698 2.887 0.00389 **
## absences 0.09245 0.02323 3.979 6.91e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 406.99 on 366 degrees of freedom
## AIC: 414.99
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(my_model2) %>% exp
# compute confidence intervals (CI)
CI <- confint(my_model2) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1434892 0.08940913 0.2219242
## sexM 2.7109881 1.67967829 4.4367282
## failures 1.8177381 1.21997630 2.7642305
## absences 1.0968598 1.05041937 1.1508026
In the first model, final grade has a p-value 0.2, and thus isn’t associated with high alcohol consumption. Let’s fit the model again without it. In the second model, all variables are associated with the outcome, and also the AIC drops from 415.59 to 414.99, which indicates that G3 can be dropped from the model.
According to this model, high alcohol consumption is associated with
being male (OR 2.7, 95% CI 1.7-4.4, p-value <0.001), having failed
class (OR 1.8, 95% CI 1.2-2.8, p-value 0.004), and school absences (OR
1.1, 95% CI 1.1-1.2, p-value <0.001).
My primary hypothesis were correct regarding being male and having
failed classes. However, there is only slight association with school
absence (OR 1.1), and the final grade doesn’t seem to be associated with
high alcohol consumption.
# predict() the probability of high_use
probabilities <- predict(my_model2, type = "response")
library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 7
## TRUE 78 33
# draw a plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability, y = high_use)) + geom_point()
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297
In the 2x2 table we can see that 252 low and 33 high alcohol
consumers are predicted correctly. In the predictions, however, there
are 78 false negatives and 7 false positives. According to the training
error, 23% of the individuals are classified inaccurately. This is
better than tossing a coin (where the chances would be 50/50), but I
guess with some background information one could get quite close with
educated guesses.
## 10-fold cross-validation of the model
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = my_model2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405
#Let's try the same with another model.
my_model3 <- glm(high_use ~ traveltime + sex + absences +studytime + goout, data = alc, family = "binomial")
summary(my_model3)
##
## Call:
## glm(formula = high_use ~ traveltime + sex + absences + studytime +
## goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8435 -0.7908 -0.4914 0.6936 2.5211
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.72342 0.67103 -5.549 2.88e-08 ***
## traveltime 0.35147 0.18157 1.936 0.052908 .
## sexM 0.81643 0.27225 2.999 0.002710 **
## absences 0.07919 0.02293 3.454 0.000552 ***
## studytime -0.40974 0.17327 -2.365 0.018041 *
## goout 0.71830 0.12152 5.911 3.40e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 363.51 on 364 degrees of freedom
## AIC: 375.51
##
## Number of Fisher Scoring iterations: 4
prob2 <- predict(my_model3, type = "response")
loss_func(class = alc$high_use, prob = prob2)
## [1] 0.2027027
cv2 <- cv.glm(data = alc, cost = loss_func, glmfit = my_model3, K = 10)
cv2$delta[1]
## [1] 0.2108108
The prediction error of the model when using 10-fold cross-validation
is around 0.24, which is slightly less than in the exercise set.
With the other tested model (explanatory variables being sex, home to
school travel time, number of school absences, weekly study time, and
going out with friends), the prediction error with 10-fold
cross-validation is around 0.22, which is still lower.
date()
## [1] "Sat Nov 26 09:40:53 2022"
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# check structure and dimensions
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The dataset contains housing values in suburbs of Boston. It can be downloaded from R’s MASS package, and contains 506 observations of 14 variables. More information on the data and variables can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
library(tidyr)
# check summary of data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# draw pairs plot of the variables
pairs(Boston)
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
cor_matrix %>% round(2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")
The summary of the data shows means, medians and ranges of the different variables, for example median number of rooms per dwelling is 6.2 (range 3.6-8.8).
Tax rate and accessibility to radial highways seem to be strongly positively correlated. Median value and lower status of the population are negatively correlated. The plotted correlation matrix shows also other positive and negative correlations. The darker blue a sphere is, the stronger the positive correlation between variables, and the darker red a sphere is, the stronger the negative correlation.
library(dplyr)
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, label = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
The variables are now transformed on the same scale, so now we can compare them.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2623762 0.2475248 0.2500000
##
## Group means:
## zn indus chas nox rm age
## low 1.0623042 -0.9737973 -0.15056308 -0.8995018 0.4601568 -0.8819699
## med_low -0.1055932 -0.2299906 -0.01233188 -0.5406381 -0.1296533 -0.3484640
## med_high -0.4006284 0.2382999 0.23949396 0.4386949 0.0837805 0.4276585
## high -0.4872402 1.0171306 -0.03844192 1.0355576 -0.3292932 0.8111877
## dis rad tax ptratio black lstat
## low 0.9485473 -0.6965304 -0.7471819 -0.49327095 0.37446906 -0.76119547
## med_low 0.3266006 -0.5582386 -0.4699227 -0.07576364 0.31999182 -0.12890466
## med_high -0.4189792 -0.4030438 -0.2802181 -0.24828966 0.07994062 0.01203417
## high -0.8541998 1.6379981 1.5139626 0.78062517 -0.75845061 0.88700601
## medv
## low 0.527191704
## med_low 0.007100809
## med_high 0.145284175
## high -0.663589680
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09274900 0.713399265 -0.87881825
## indus 0.07315641 -0.340287926 0.46396636
## chas -0.10129945 -0.069721138 0.09819383
## nox 0.37148040 -0.632326731 -1.44874037
## rm -0.07867775 -0.050032220 -0.17057754
## age 0.20779845 -0.198451832 -0.23755017
## dis -0.09273396 -0.136042584 0.11668886
## rad 3.18099891 0.864673450 0.08899972
## tax -0.01732131 0.059112429 0.39569648
## ptratio 0.10182160 0.014060525 -0.29473522
## black -0.10318516 -0.006286222 0.10384031
## lstat 0.17357363 -0.120483235 0.31729751
## medv 0.15905225 -0.274484908 -0.20223647
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9448 0.0405 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 16 0 0
## med_low 1 17 2 0
## med_high 0 10 15 1
## high 0 0 0 26
It would seem that with high crime rate the predictions are most accurate. With low to medium and medium to high there is most inaccuracy.
library(ggplot2)
data(Boston)
boston_scaled2 <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
set.seed(123)
km <- kmeans(boston_scaled2, centers = 3)
# plot part of the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)
# determine the maximum number of clusters
k_max <- 10
# calculate the total within sum of squares
set.seed(123)
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
set.seed(123)
km <- kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
# plot part of the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)
From plotting the total within sum of squares we see that around two the value changes quite a lot, so the appropriate number of cluster would be two. The data seems to divide nicely between two clusters according to most variables.
data(Boston)
boston_scaled3 <- as.data.frame(scale(Boston))
# k-means clustering
set.seed(123)
km2 <- kmeans(boston_scaled3, centers = 3)
# linear discriminant analysis
lda.fit2 <- lda(km2$cluster ~., data = boston_scaled3)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km2$cluster)
# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 4)
Variables age, black, and tax seem to be the most influential linear separators for the clusters.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
lda.fit3 <- lda(crime ~., data = train)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit3$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit3$scaling
matrix_product <- as.data.frame(matrix_product)
classes <- as.numeric(train$crime)
train2 <- dplyr::select(train, -crime)
set.seed(123)
km3 <- kmeans(train2, centers = 4)
clusters <- as.numeric(km3$cluster)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=classes)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=clusters)
The cluster that is on the left-hand side seems quite similar in both plots and is quite well defined. In the second plot the clusters are more defined, in the first plot the rest of the clusters mix more with each other.
(more chapters to be added similarly as we proceed with the course!)